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VOICE  PERFORMANCE  MEASUREMENTS 


★ 

W.  J.  Hartman  and  L.  E.  Pratt 

Subjective  intelligibility  scores  are  determined  for 
various  combinations  of  signal-to-noise  ratios,  clipping, 
filtering,  and  pre-emphasis/de-emphasis.  These  are  com¬ 
pared  with  scores  derived  using  linear  predictive  coding 
(LPC).  An  overall  correlation  of  0.88  is  obtained  between 
the  two  sets  of  scores. 

Key  words:  Distortion,  intelligibility,  interference, 
objective  scoring. 

1 .  INTRODUCTION 

An  objective  method  of  determining  articulation  score  was  developed  by  Gamauf 
and  Hartman  [1977].  The  purpose  of  this  paper  is  to  compare  the  results  of  this 
scoring  with  subjective  (listener  panel)  scores  for  a  variety  of  conditions  affec¬ 
ting  the  intelligibility.  The  original  objective  method  reported  in  Gamauf  and 
Hartman  [1977]  is  modified  in  this  study,  the  main  modification  being  the  development 
of  a  hardware  word  alignment  device  which  is  described  in  Appendix  A. 

Previous  tests  for  which  the  objective  scoring  was  obtained  were  done  using  rf 
modulation  and  demodulation,  with  the  noise  and  distortion  introduced  at  the  rf 
frequency.  The  tests  reported  here  were  done  entirely  at  baseband. 

2.  VOICE  SCORING  METHODS 

In  order  to  develop  an  objective  intelligibility  measure  for  corrupted  speech, 
a  comparison  must  be  performed  between  the  distorted  speech  and  the  original  noise- 
free  speech.  A  subjective  intelligibility  measure  of  the  distorted  speech  must 
also  be  available  in  order  to  judge  the  quality  of  the  objective  measure  being  used. 
Both  of  these  requirements  are  met  by  first  making  a  noise-free  master  tape  of  pre¬ 
selected  speech,  and  then  sending  it  through  voice  communication  channels  to  be 
tested,  and  making  a  recording  of  the  speech  at  the  channel  output.  The  comparisons 
between  subjective  and  objective  scores  is  thus  obtained  from  the  same  set  of  output 
data. 


The  authors  are  with  the  U.S.  Department  of  Commerce,  National  Telecoimunications 
and  Information  Administration,  Institute  for  Telecommunication  Sciences, 

Boulder,  CO  80303. 


The  pre-selected  speech  samples  to  be  sent  over  a  voice  channel  for  intelligi¬ 
bility  scoring  are  phonetically  balanced  (PB)  groups  of  isolated  words  as  opposed 
to  complete  sentences  or  nonsense  syllables.  These  P  B  words  were  used  because 
subjective  scores  have  been  shown  to  be  repeatable,  which  is  a  necessary  criterion 
for  this  study  because  the  objective  measure  will  be  repeatable.  Eight  PB  word 
groups,  each  containing  fifty  isolated  words  were  selected  as  the  test  speech.  The 
resulting  scores  are  called  articulation  scores  (AS). 

An  analog  tape  containing  all  eight  word  groups  and  using  both  male  and  female  t 

trained  speakers  was  obtained  from  the  Army  Electronic  Proving  Ground  Electromagnetic 
Environment  Test  Facility  (EMETF)  at  Fort  Huachuca,  Arizona.  From  this  tape,  a 
master  analog  tape  was  made  that  would  be  sent  over  voice  channels  and  later  com¬ 
pared  with  the  recorded  output  of  the  channel.  In  order  to  perform  this  comparison, 
the  two  tapes  must  be  aligned,  which  means  that  synchronization  information  must  be 
included  on  the  master  tape  before  being  sent  across  the  voice  channel.  Because  the 
tapes  are  digitied  at  a  10  kHz  rate,  the  alignment  procedure  would  also  have 
to  work  in  a  digital  format.  It  was  found  that  a  shift  of  plus  or  minus  10  samples 
of  a  256-sample  analysis  window  caused  the  predictor  coefficients  to  vary  less  than 
0.1%  in  all  cases.  Therefore,  the  synchronization  procedure  to  be  used  was  required 
to  align  two  segments  of  digitized  speech  to  within  10  samples. 

A  synchronization  procedure  that  was  found  to  meet  the  required  10-sample 
variation  specification  made  use  of  a  binary  pseudo  noise  (PN)  sequence.  A  length 
127  binary  PN  sequence,  generated  at  a  635  Hz  clock  rate  was  sent  through  a  phase- 
continuous  frequency  shift  keying  modem  using  the  two  frequencies  1.2  kHz  and 
2.2  kHz.  This  PN  signal  was  then  placed  before  each  word  and  after  the  last  word 
of  all  eight  word  groups  thereby  creating  the  master  analog  tape  with  alignment 
capabilities.  In  the  previous  work,  Gamauf  and  Hartman  (1977),  the  correlation 
between  PN  sequences  from  the  master  tape  and  the  tape  from  the  system  output  was 
done  in  a  computer.  For  this  study,  a  hardware  correlation  device  described  in 
Appendix  A  was  used  for  the  alignment. 

3.  TEST  RESULTS 

Eight  different  test  conditions,  involving  pre-emphasis/de-emphasis,  channel 
bandwidth  filters,  channel  weighting,  clipping  and  interference  were  set  up  in 
accordance  with  the  block  diagram  of  Figure  1.  Within  each  test  condition,  four 
values  of  a  parameter  (e.g.,  the  signal-to-noise  ratio),  were  used  to  give  a  set 
of  32  measurements.  Table  1  gives  the  test  and  parameter  values  for  the  32  tests. 

Details  of  the  test  condition  are  given  in  Appendix  B. 
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Time  -Code 


test  configurations. 


The  tapes  were  copied  and  sent  to  EMETF,  Fort  Huachuca,  Arizona,  for  the 
subjective  scoring. 

For  the  objective  scoring  the  output  of  the  analog  recorder  was  first  con¬ 
ditioned  as  shown  in  Figure  2  and  then  fed  to  both  the  PN  sequence  detector  and 
the  digitizer  as  shown  in  Figure  3.  The  PN  sequence  detector  printed  the  sample 
number  corresponding  to  the  corresponding  digital  sample  whenever  the  PN  signal 
on  the  tape  aligned  with  a  reference  PN  signal  to  allow  accurate  word  alignment. 

The  digitized  signals  were  then  processed  according  to  the  methods  of  Gamauf  and 
Hartman  to  obtain  the  objective  score. 

Table  2  gives  the  subjective  scores  and  objective  scores  for  all  the  tests. 

No  objective  scores  could  be  obtained  for  test  7  due  to  the  combination  of  the 
analysis  method  used  and  the  coherency  of  the  sine  wave.  A  modification  of  the 
method  to  resolve  this  problem  is  possible,  but  beyond  the  scope  of  the  present 
effort. 

The  objective  scores  are  plotted  vs  the  subjective  scores  in  Figure  4.  The 
cross  correlation  coefficient  between  the  two  sets  of  scores  is  0.88. 

4.  DISCUSSION  OF  RESULTS 

The  worst  comparison  between  the  objective  and  subjective  scores  appears  to 
be  for  the  clipping -narrow  band  case  (test  2  conditions  2  and  4)  which  is  not 
unexpected,  and  for  the  worst-case  conditions  that  produced  the  lowest  AS.  It 
is  mildly  surprising  however,  that  the  scores  agree  so  well  for  the  case  of  cross¬ 
talk  (test  8)  since  LPC  (linear  predictive  coding)  voice  systems  do  not  perform  well 
with  combinations  of  more  than  one  talker  at  a  time. 

The  results  support  the  objective  measure  as  a  predictor  for  articulation 
score.  Moreover,  the  cost  reduction  achieved  using  the  hardware  word  alignment 
makes  the  process  attractive  for  testing  voice  systems. 

5.  REFERENCE 
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Figure  2.  Block  diagram  of  the  signal  conditioning  network  for  digitizing  the  voice  tapes. 


Figure  3.  Block  diagram  showing  the  positioning  of  the  PN 
sequence  detector  in  the  digitizing  operations. 


Table  2.  Subjective  and  Objective  Scores  for  the  Tests  Described  In  Table  1 


Test  # 

Condition  # 

Objective 
Score/ 100 

Subjective 
Score/ 100 

1 

1 

.03 

.28 

2 

.48 

.50 

3 

.58 

.65 

4 

.83 

.81 

2 

1 

.53 

.71 

2 

.53 

.69 

3 

.91 

.87 

4 

.58 

.85 

3 

1 

.34 

.16 

2 

.47 

.23 

3 

.49 

.55 

4 

.78 

.79 

4 

1 

.44 

.23 

2 

.56 

.44 

3 

.63 

.60 

4 

.82 

.88 

5 

1 

.10 

.14 

2 

.20 

.24 

3 

.56 

.44 

4 

.67 

.80 

6 

1 

.15 

.06 

2 

.25 

.12 

3 

.32 

.23 

4 

.58 

.62 

7 

1 

.52 

2 

— 

.80 

3 

— 

.76 

4 

- - 

.83 

8 

1 

.04 

.10 

2 

.36 

.46 

3 

.68 

.62 

4 

.75 

.71 
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APPENDIX  A:  CORRELATION  DETECTOR 

The  correlation  detector  was  designed  and  built  by  ITS  personnel.  Its  function 
is  to  find  the  point  of  highest  correlation  between  an  Incoming  signal  and  a  stored 
reference  signal.  Use  of  a  10-KHz  sample  rate  gave  a  100  us  accuracy. 

Refer  to  the  block  and  schematic  diagrams  (Figures  Al,  A2,  A3).  The  standard 
reference  signal  was  digitized  in  2048  samples,  and  only  the  polarity  of  the  samples 
was  permanently  stored  as  the  image  in  programmable  read-only  memories  (PROW).  The 
incoming  analog  signal  is  polarity  detected  and  stored  in  random-access  memory  (RAM). 
The  contents  of  the  PROM  and  RAM  are  compared  bit  by  bit  in  the  exclusive-  or  gate- 
correlator  and  fed  to  the  correlation  counter.  This  is  done  8  bits  at  a  time  in 
parallel  because  the  PROM's  and  RAM's  will  not  operate  at  the  necessary  clock  rate 
of  20.56  MHz. 

At  the  completion  of  each  2048  bit  correlation,  the  output  of  the  correlation 
counter  is  sent  to  a  buffer  (latch)  and  to  a  digital  comparator  for  comparison  with 
the  current  contents  of  the  buffer.  If  the  current  count  exceeds  1280  and  is  larger 
than  the  previous  count,  the  current  count  is  stored  in  the  buffer  and  the  digitized 
data  sample  number  is  stored  In  the  sample  counter.  (This  operation  is  shown  on  the 
block  diagram  only.)  At  this  time  a  new  signal  value  is  brought  into  the  RAM  and 
the  contents  of  the  RAM  are  shifted  by  one  to  allow  the  correlator  to  operate  on  a 
new  relative  position  of  the  data  and  the  stored  standard.  The  2048th  bit  is  lost 
because  it  has  served  its  function  and  is  no  longer  needed.  This  action  also  starts 
a  3-ms  retriggerable  time  delay.  This  allows  the  circuit  to  keep  looking  for  higher 
correlation  values  to  insure  that  the  highest  value  for  the  current  word  synchroniza¬ 
tion  has  been  found.  At  the  end  of  this  3-ms  delay,  a  second  3-ms  pulse  is  generated 
to  cause  the  printer  to  print  the  digitized  data  sample  nunber  and  to  reset  the  buffer 
so  the  correlator  can  start  looking  for  synchronization  of  the  next  word. 
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Correlation  detector 


APPENDIX  B:  TEST  CONFIGURATIONS 

The  8  test  conditions  outlined  in  Table  1  are  briefly  discussed  in  this 
section. 

(a)  Filters 

The  filters  used  were  4  pole  bandpass  filters  with  the  6-dB  bandwidths 
of  300  Hz  -  3  kHz  for  the  narrowband  mode  (N)  and  200  Hz  -  10  kHz  for 
the  wideband  mode  (W).  The  F1A  weighing  is  discussed  in  ITT  (1969). 

(b)  Pre-emphasi s/De-emphasl s 

Figure  B1  shows  the  results  of  several  measurements  of  the  6-dB/octave 
pre-emphasis,  the  6  dB  de-emphasis,  and  the  combined  effect.  Figure  B2 
shows  only  the  pre-emphasis  curve  for  the  3-dB/octave  setting. 

(c)  Noise 

The  noise  added  to  the  signal  was  white  gaussian  noise  from  a  broadband 
noise  generator.  The  noise  power  in  the  received  bandwidth  was  measured 
on  an  rms  voltmeter,  as  was  the  signal  without  noise.  An  additional 
measurement  of  signal  plus  interference  plus  distortion  to  interference 
plus  distortion,  was  made  using  a  1-kHz  tone  with  the  same  rms  voltage 
level  as  the  speech,  and  using  a  notch  filter  with  60  dB  attenuation  on 
1  kHz.  Figure  83  shows  a  block  diagram  of  this  system. 

(d)  Clipping 

The  clipping  was  accomplished  using  the  network  diagrammed  in  Figure  B4. 

The  rms  audio  signal  voltage  was  preamplified  to  30  dB  above  the  clipping 
level.  The  signal  plus  interference  plus  distortion  to  interference 
plus  distortion  ratio 

S+I+D 

I+D 

is  determined  by  the  harmonics,  and  would  be  7  dB  if  the  clipping 
produced  a  square  wave. 

(e)  Crosstalk 

The  intelligible  crosstalk  used  was  prerecorded  speech  taken  from  AM 
radio  broadcasts.  The  same  segment  was  used  during  each  of  the  four  test 
conditions  and  Included  three  contiguous  time  frames;  each  with  a  different 
speaker,  one  at  the  average  level,  one  3  dB  (approximately)  below  the 
average  level  and  one  3  dB  (approximately)  above  the  average  level  for 
the  entire  segment. 
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(f)  Sine  Wave  Interference 

A  2-kHz  sine  wave  was  used  as  interference  in  test  setup  #8.  The 
interference  ratios  are  the  ratios  of  the  rms  voltages  at  the  input 
to  the  recorder. 

(g)  Recording 

The  voice  signals  were  recorded  on  a  1/2  inch  tape  recorder  at  15  ips. 
Both  the  clear  signal  and  the  noisy  signal  were  recorded  on  separate 
tracks.  These  signals  were  then  transcribed  onto  1/4  inch  tape 
(reel-to-reel )  for  the  subjective  scoring  which  was  done  by  the 
U.S.  Army  Electronics  Proving  ground  at  Ft.  Huachuca,  AZ.  The  1/2  inch 
tapes  were  used  in  the  data  analysis  described  in  Section  4. 
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Measurement  of  the  pre-emphasis  (L),  de-emphasis  (R),  and  combined  (dashed) 
characteristics  for  the  6-dB/octave  above  1-Hz  case.  (The  different  symbols 
on  the  (L)  curve  are  measurements  made  on  different  days.) 
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Figure  B2„  Measurement  of  the  3-dB/octave  pre-emphasis  characteristics. 
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Figure  83.  Block  diagram  for  test  #1.  The  notch  filter  S+I+D 
and  rms  voltmeter  are  used  for  the  measurement  of  > ,n 


